17 research outputs found

    Five Tales of Random Forest Regression

    Get PDF
    We present a set of variations on the theme of Random Forest regression: two applications to the problem of estimating galactic distances based on photometry which produce results comparable to or better than all other current approaches to the problem, an extension of the methodology to produce error distribution variance estimates for individual regression estimates which property appears unique among non-parametric regression estimators, an exponential asymptotic improvement in algorithmic training speed over the current de facto standard implementation which improvement was derived from a theoretical model of the training process combined with competent software engineering, a massively parallel implementation of the regression algorithm for a GPGPU cluster integrated with a distributed database management system resulting in a fast roundtrip ingest-analyze-archive procedure on a system with total power consumption under 1kW, and a novel theoretical comparison of the methodology with that of kernel regression relating the Random Forest bootstrap sample size to the kernel regression bandwidth parameter, resulting in a novel extension of the Random Forest methodology which offers lower mean-squared error than the standard methodology

    The Fifth Data Release of the Sloan Digital Sky Survey

    Get PDF
    This paper describes the Fifth Data Release (DR5) of the Sloan Digital Sky Survey (SDSS). DR5 includes all survey quality data taken through June 2005 and represents the completion of the SDSS-I project (whose successor, SDSS-II will continue through mid-2008). It includes five-band photometric data for 217 million objects selected over 8000 square degrees, and 1,048,960 spectra of galaxies, quasars, and stars selected from 5713 square degrees of that imaging data. These numbers represent a roughly 20% increment over those of the Fourth Data Release; all the data from previous data releases are included in the present release. In addition to "standard" SDSS observations, DR5 includes repeat scans of the southern equatorial stripe, imaging scans across M31 and the core of the Perseus cluster of galaxies, and the first spectroscopic data from SEGUE, a survey to explore the kinematics and chemical evolution of the Galaxy. The catalog database incorporates several new features, including photometric redshifts of galaxies, tables of matched objects in overlap regions of the imaging survey, and tools that allow precise computations of survey geometry for statistical investigations.Comment: ApJ Supp, in press, October 2007. This paper describes DR5. The SDSS Sixth Data Release (DR6) is now public, available from http://www.sdss.or

    The Seventh Data Release of the Sloan Digital Sky Survey

    Get PDF
    This paper describes the Seventh Data Release of the Sloan Digital Sky Survey (SDSS), marking the completion of the original goals of the SDSS and the end of the phase known as SDSS-II. It includes 11663 deg^2 of imaging data, with most of the roughly 2000 deg^2 increment over the previous data release lying in regions of low Galactic latitude. The catalog contains five-band photometry for 357 million distinct objects. The survey also includes repeat photometry over 250 deg^2 along the Celestial Equator in the Southern Galactic Cap. A coaddition of these data goes roughly two magnitudes fainter than the main survey. The spectroscopy is now complete over a contiguous area of 7500 deg^2 in the Northern Galactic Cap, closing the gap that was present in previous data releases. There are over 1.6 million spectra in total, including 930,000 galaxies, 120,000 quasars, and 460,000 stars. The data release includes improved stellar photometry at low Galactic latitude. The astrometry has all been recalibrated with the second version of the USNO CCD Astrograph Catalog (UCAC-2), reducing the rms statistical errors at the bright end to 45 milli-arcseconds per coordinate. A systematic error in bright galaxy photometr is less severe than previously reported for the majority of galaxies. Finally, we describe a series of improvements to the spectroscopic reductions, including better flat-fielding and improved wavelength calibration at the blue end, better processing of objects with extremely strong narrow emission lines, and an improved determination of stellar metallicities. (Abridged)Comment: 20 pages, 10 embedded figures. Accepted to ApJS after minor correction

    The Second Data Release of the Sloan Digital Sky Survey

    Get PDF
    The Sloan Digital Sky Survey (SDSS) has validated and made publicly available its Second Data Release. This data release consists of 3324 deg2 of five-band (ugriz) imaging data with photometry for over 88 million unique objects, 367,360 spectra of galaxies, quasars, stars, and calibrating blank sky patches selected over 2627 deg2 of this area, and tables of measured parameters from these data. The imaging data reach a depth of r ≈ 22.2 (95% completeness limit for point sources) and are photometrically and astrometrically calibrated to 2% rms and 100 mas rms per coordinate, respectively. The imaging data have all been processed through a new version of the SDSS imaging pipeline, in which the most important improvement since the last data release is fixing an error in the model fits to each object. The result is that model magnitudes are now a good proxy for point-spread function magnitudes for point sources, and Petrosian magnitudes for extended sources. The spectroscopy extends from 3800 to 9200 Å at a resolution of 2000. The spectroscopic software now repairs a systematic error in the radial velocities of certain types of stars and has substantially improved spectrophotometry. All data included in the SDSS Early Data Release and First Data Release are reprocessed with the improved pipelines and included in the Second Data Release. Further characteristics of the data are described, as are the data products themselves and the tools for accessing them

    doi:10.1088/0004-637X/712/1/511 RANDOM FORESTS FOR PHOTOMETRIC REDSHIFTS

    No full text
    The main challenge today in photometric redshift estimation is not in the accuracy but in understanding the uncertainties. We introduce an empirical method based on Random Forests to address these issues. The training algorithm builds a set of optimal decision trees on subsets of the available spectroscopic sample, which provide independent constraints on the redshift of each galaxy. The combined forest estimates have intriguing statistical properties, notable among which are Gaussian errors. We demonstrate the power of our approach on multi-color measurement
    corecore